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Origin of the Native Driving Force for Protein Folding 
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We derive an expression with four adjustable parameters that reproduces well the 20 x 20 Miyazawa- 
Jernigan potential matrix extracted from known protein structures. The numerical values of the 
parameters can be approximately computed from the surface tension of water, water-screened dipole 
interactions between residues and water and among residues, and average exposures of residues in 
folded proteins. 



PACS numbers: 87.15.By, 02.10.Sp, 64.75.+g 

Protein structure and design is a very important topic 
in life science where physics and mathematics are in- 
dispensable to its understanding Q. Recently Li et al. 
pointed out some highly interesting and unexpected 
properties of Miyazawa and Jernigan's 20 x 20 poten- 
tial matrix (M) for protein structure 13] 0. This ma- 
trix, whose elements are statistically deduced pair-wise 
interaction potential energies among the twenty types 
of amino acids in proteins of known structure, has been 
widely applied to protein design and folding simulations 
H U 0- Li et al noticed that M has a highly accu- 
rate leading principal-component representation: varia- 
tions of the elements of M from their mean can be ex- 
pressed in terms of only the two leading eigenvalues of 
M and the eigenvector q of the leading eigenvalue such 
that 



Mi. 



C2qiq 3 + cife + Qj) 

where i and j label the 20 amino acids, and Co = —1.38, 
c\ = 5.03 and c-2 = —7.40, in units of RT, the gas con- 
stant times (room) temperature. 

Two features of the right-hand-side of Eq. (Q) stand 
out: 1) Not all residue-dependent terms are genuine 
two-body interactions; the c± terms represent one-body, 
mean-field potential energies. 2) Both the two-body Ci 
terms and the one-body c\ terms depend on the same 
set of g's. Numerically, because the magnitudes of the 
q's are small, the ci terms dominate over the C2 term. 
This is consistent with the widely held notion that the 
earliest and fastest part of a protein folding process is 
by and large controlled by the hydrophobicity || of the 
residues. Tables I and II show that indeed q is moder- 
ately correlated with the hydrophobicities (AG) The 
product, pairwise form of the two-body terms reminds 
one of dipole-dipole interaction, and this in turn would 
imply a connection between the one-body terms and the 
dipole moments of the residues. Tables I and II also 
show a noticeable correlation between q and the dipole 
moments (Q) of the side-chains of the residues jHJ . In 
the rest of the paper we will derive an expression for the 
MJ matrix in terms of an average "bare" residual sol- 
vation energy (for a hypothetical residue with vanishing 



Co, 



(1) 



dipole), interactions between the dipole moments of the 
residues and water molecules, and the degree of exposure 
to water (expressed as its complement, the burial factor) 
of a residue in a folded protein. We show that except 
for the burial factor of the residues the other three ad- 
justable parameters appearing in the expression all have 
clear physical meanings with numerical values that can 
be computed approximately. The average burial factors 
for hydrophobic and hydrophilic residues that emerge 
from our analysis of the MJ matrix are 0.8 and 0.2, re- 
spectively (they are related and should approximately 
sum to 1). In this paper energy will be given in units of 
RT = 0.60 kcal/mol = 4.2 x 10~ 21 J and dipole moments 
will be given in Debyes (D). 

Dipole-dipole interaction. The interaction in vacuum 
between two electric dipoles Qi and Qj separated by 



Ri 



nRij is Vij 



{Qi ■ Qj -3(n-Qi)(h-Qj))/(4Tce Rl 



If the carriers of the dipoles are relatively unconstrained 
we expect attraction and — < Vij < 0, where 

|/i r | = D 2 /2TTe R^. In what follows, Q t , i = l,---,20 
is the dipole moment of the i th side-chain, and Q w is 
the dipole moment of a water molecule. For residue- 
residue interaction, taking the inter-side-chain distance 
to be Rij = Rq = 6.5 A ||, and recalling that an electron- 
positron pair separated by one A is equal to 4.8 D, we 
have \n r \ « 0.172 (RT), which may be viewed as a max- 
imum value for the coupling since in a real setting it is 
expected to be weakened owing to the presence of water 
molecules. 

One-body terms. Let Eq be the average bare surface- 
dependent solvation energy of a residue in water when 
the residue- water dipole interaction is not taken into ac- 
count; N w be the average number of water molecules 
in contact with a residue; n w be the average effective 
dipole-dipole coupling between the i th residue and a wa- 
ter molecule. Then, with residue-water interaction en- 
ergy included and possible dependence of Eq, /i w and 
N w on % ignored, the residue-water interaction energy is 
Ei = HwQiQ w N w + E = fi w Q*Q w N w , where for conve- 
nience we write Q* = Q t + Q and Q = E /{^ W Q W N W ). 
A hydrophobic (hydrophilic) residue would have Ei > 



(Ei < 0). If Ni is the number of the type i th residues 
in a peptide, then the energy of an unfolded peptide in 
water is U = J^i^i^i- Suppose that after folding AN 
fewer i th residues are exposed to water. Then the bind- 
ing energy of the folded relative to the unfolded state is 
AU = — Y2i AiVj^j. The negative sign means that in 
folding, the peptide will maximize (minimize) those AN 
whose Ei are the most positive (negative), subject to the 
constraint of polymeric nature of the peptide. 
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FIG. 1. (a) (,iQ* vs. ciq* . (b) The residue dipole-dipole 
interaction vs. the two-body term in the MJ matrix, (c) The 
right-hand side of Eqs.(Q) vs. the complete MJ matrix. 

Relation between q and Q. Equating AU with the 
binding energy obtained from Eq. (|l]) by summing the 
one-body terms over all pairs we have 

AU « c x N c V N t q* = -^ W Q W N W V Q*ANi (2) 



where q* = qi — qo, qo is a constant and N c is the aver- 
age number of contacts a residue has in a folded state. 
Matching the i-dependent terms we have 



ciq* oc & 



& = -LL w (ANi/Ni)(N w Q w /N c ). (3) 



Because in a folded protein proportionally more hy- 
drophobic (h) residues than polar (p) residues will be hid- 
den from water, one expects AN/N, hence to have a 
strong residual dependence. To minimize the number of 
parameters we allow £j to have only two values: £/j and 
£p, and have them determined by separate linear fits to 
q's belonging to hydrophobic and hydrophilic residues, 



respectively. Excluded in the fits are residues whose hy- 
drophobicities are ambivalent |yj - Tyr, Ala, Gly, Thr, 
Ser and Pro. Demanding that the two fits have the same 
intercepts we obtain 



?0 



-0.055, Qo = -2.9; & = 0.56, £ p = 0.14 (4) 



The linear correlation between q and t;Q* over the com- 
plete set of 20 residues - following || and the first 
eight amino acids in Table I are taken to be hydrophobic - 
is 0.949, which is dramatically better than the correlation 
between q and Q, see Fig. 1(a) and Table II. 

The burial factor. Since on average the numbers of 
hydrophobic and polar residues in a protein are approxi- 
mately equal and about half of all residues are buried in 
the core, we have N h w N p , A(N h +N p )/(N h +N p ) w 1/2 
and hence AN p /N p « 1 - ANh/Nh- From the ratios 
of the two £'s we thus deduce the burial factors for hy- 
drophobic and polar residues, respectively, to be 



ANh/Nh ~ 0.80, AJVp/JVp w 0.20. 



(5) 



That is, our analysis of the MJ matrix suggests that 
on average four times as many hydrophobic residues are 
buried in the core than are polar residues. 

Two-body terms. We define the true two-body part 
of the MJ matrix to be the matrix minus the one-body 
and constant part of Eq. (|l|): Mij — c — C\{qi + q/). 
This two-body part is again well approximated by c' 2 qiqj, 
c' 2 = —10.7, with which it has a linear correlation of 
0.832. When d 2 q%qj is re-expressed in terms of Q* using 
Eq. (^) the shift qo induces an additional one-body term 
such that 



c[(qi 



OS) 



const. 



(6) 



Where C 2 = c' 2 /c\ = -0.423 and c[ = c x - c' 2 q = 5.62. 
The linear correlation between My — c'^qi + q.j) and 
ZiZjQiQj is 0.681, see Fig. 1(b). Given that the dipole 
moments and £/, and ^ p are predetermined, the first term 
on the right-hand-side of Eq. (^|) is a one free parameter 
(C 2 ) fit to 210 pieces of "noise" in the MJ matrix. The 
mediocre quality of the correlation nevertheless suggests 
that the two-body term cannot be explained by dipole 
interactions alone; interactions depending on charge and 
polarizability may need to be included. The inclusion of 
such terms may cause the two-body term to deviate from 
having the simple qq form suggested by in Eq. |l[ Owing 
to its relative small magnitude such a deviation should 
be tolerable to the original MJ matrix. 

MJ matrix in terms of Q*. Re-expressing the one- 
body term in Eq. (^|) in terms of Q* and rationalizing 
notations by writing //y = C 2 ^j and ^ = ^c^/ci we 
finally have 



My HijQtQj + Z'jQj + &Qi + const. 



(7) 



where fi hh = -0.13, fi hp = -0.032, \x vv = -0.0078, £ h = 
0.63 and £' p = 0.15. The two sides of the equation has 
a linear correlation of 0.922, see Fig. 1(c). Since Qi is 
either zero or positive, the negative values of /i^ imply 
that the dipoles mostly succeed in causing the residues 
to lower their energies. That is, even in a folded state 
the the residues appear to be sufficiently unrestricted to 
find optimum orientations. To the extent that the dipole 
moments of the side-chains are not free parameters, the 
expression on the right-hand-side is a four parameter fit 
- C2, Eq, ANh/Nh and fi w (see below) - to the complete 
MJ matrix. 

Residue-residue dipole coupling. By definition /iy cx 
(ANi /Ni ) ( AiVj /Nj), With ANi/Ni describing the per- 
centage of buried residues in a folded protein, the in- 
equalities \fJ,p P \ < \fJ,hp\ < IfJ'hh] < I I correctly take 
into account the dielectric property of water: the cou- 
pling between residues shielded from water is stronger 
than that between residues that are not. The magnitude 
of the weighed average of the residue-residue coupling, 
fiij = (7[Xp P + 6^hp + 7fihh)/20 = —0.041, is about four 
times less than the bare coupling strength of \fi r \ = 0.172. 

Water-residue coupling. We can obtain the ef- 
fective water-residue coupling from the relation £j = 
— Li w (ANi/ Ni){N w Q w / N c ) given earlier. Using the value 
6.5 A for the average effective diameter of a residue and 
the value 2 A for the diameter of a water molecule, we es- 
timate that a residue may have a maximum of 12 residue 
contacts and 57 water molecule contacts. In practice the 
number of contacts is encumbered by the presence of the 
peptide backbone and geometric constraints, such that 
in fact N c Ri 7 j|. We therefore scale N w down to k 35. 
With Q w = 1.85 D, we deduce from Eqs. © and (@) 
that /iu, « —0.076 {RT). The negative sign of [i w is 
consistent with the notion that the presence of dipole 
in a residue reduces its hydrophobicity. Taking the av- 
erage water-residue distance to be 4.25 A we expect the 
bare water-residue coupling to be (6.5/4.25) 3 = 3.5 times 
stronger than the bare residue-residue coupling. How- 
ever, in an unfolded state the residues are completely 
exposed to water. We therefore expect the approximate 
relations |/x pp | < \^ w \/3.5 = \fj, hp \ = |/2y| < \(i r \, which 
are satisfied. 

Solvation energy, surface tension and hydropho- 
bicity. With fi w and Qo extracted from the data we now 
find the bare solvation energy to be Eo = ii w QoQwN w = 
14.6 RT. Although hydration is an exceedingly complex 
process and is not fully understood, the effective sur- 
face tension of water, or surface free energy cost to water 
forced to sit against a hydrophobic surface has been esti- 
mated to be a — 40 erg/cm 2 jl^j . For a residue of diame- 
ter R the free energy cost is W = Atz{Rq/2) 2 u = 13 RT, 
which is reasonably close to the value of Eq . The fact that 
a good fit to the MJ matrix demands that Eq enters AU 



in Eq. || multiplied by ANi is indication that Eq needs to 
be surface energy. When the water-residue dipole inter- 
action energy is included, the total solvation energies Ei 
of the residues then delineate into groups with distinct 
hydrophobicities, with the seven most hydrophobic (hy- 
drophilic) having an average solvation energy of 13.2 RT 
(-9.3 RT). 

Very recently Keskin et al. re-analyzed the MJ ma- 
trix and derived the approximation (for ease of discussion 
the W* used here has an additional negative sign rela- 
tive to that in HI): My ^ AW* } + W* + W* + const., 
where the one-body term W* is essentially defined as 
the mean-field of My and AW*j is a four parameter fit 
to Mij minus its mean-field. The analysis confirms the 
dominance of the one-body term in the MJ matrix. The 
overall fit to the MJ matrix, with a correlation of 0.99, 
is excellent and the fit to the two-body part is about the 
same as that given by the dipole picture: the correla- 
tion between My - W* - W* and AW* is 0.67. Not 
surprisingly W* and q are closely related. The expres- 
sion rjciq + 1.16, with scale factor r\ = 1.17, reproduces 
W* with a linear correlation of 0.997. The value of 77 is 
mostly explained by the fact that the mean-field calcu- 
lated from the right-hand-side of Eq. (|lj) is 1.22c±(qi + qj). 
Incidentally, r\c\ = 5.89 is very close to the value of the 
renormalized coefficient c[ = 5.62 given in Eq. (JsJ) - 

In Table I are listed values for q, Q, W* , £,Q* , and 
hydropathy scales AG (in units of RT) corrected for self- 
solvation for the side-chains of the twenty amino acids || . 
Recall that £ contains the burial factor (see Eq. |J) and Q* 
is Q shifted by an amount proportional to E (see Eq. ||) . 
The pairwise linear correlation of the entries in Table I 
are given in column 2 of Table II. The correlation between 
£Q* and W* (and q) is very significantly better than 
that between Q and W* (and q). The linear relations 
connecting the solvation energy with W* and q: 

EiiANi/N^/Nc = &QJ £* dfe - go) S (W* - W *)/ Vl 
where Wq — 0.71 is a shift, highlight the importance of 
taking into account the burial factor of a residue in a 
folded protein when interpreting the one-body terms of 
the MJ matrix. 

The hydropathy scales shown in Table I are derived 
for side-chains in model peptides rather than in proteins. 
They include the effect of self-solvation that reduces the 
hydropathies of the polar side-chains ||], but does not 
include the effect of burial factor. This probably explains 
why, as seen in Table II, the AG - q, AG - W* 7 AG - Q 
and AG — £Q* correlations are of similar quality. 

The q and W* values of proline suggest it to be polar, 
while its Q, £,Q* and AG values say it is ambivalent or 
even hydrophobic. The third column in Table II shows 
that the correlations listed either remain unchanged or 
improve when proline is excluded from the linear fit. The 
ambiguous hydrophobicity of this residue may be related 
to the fact that is has a looping structure. 



We summarize our interpretation of Eq. (Q) being a 
good approximation of the MJ matrix as follows. The 
one-body part, or hydrophobicity (or hydropathy) en- 
ergy, is made up of two parts: free energy cost to wa- 
ter to accommodate the residue surface, and attrac- 
tive dipole interaction between residue and water. Be- 
cause polar residues have large dipole moments, hy- 
drophobic residues have small or no moments and am- 
bivalent residues have something in between, the hydro- 
pathic/hydrophobic energy is strongly attractive, weakly 
attractive and strongly repulsive for polar, ambivalent 
and hydrophobic residues, respectively. Residue-residue 
dipole interactions accounts for a sizable portion, but not 
all, of the two-body part. Aside from using the given 
dipole moments for the residues and having two burial 
factors, one each for the hydrophobic and polar residues, 
no residue-dependent adjustments were made in deriving 
Eq. ([7]), our rendition of Eq. (|j). That is, we have not 
attempted a detailed fit of the MJ matrix. The correla- 
tion between the dipoles of the residues and q becomes 
unequivocal and the strengths of the dipole couplings 
extracted from the MJ matrix become reasonable only 
when the burial factors are included in the formulation. 
That the factor is important reveals the dynamical nature 
of protein folding: strengths of interactions change as the 
folding progresses. Protein folding is a very complicated 
process that depends on many details and the MJ matrix 
does not tell its whole story. It does however contain the 
most basic structural information at the molecular level 
of those proteins whose structures are known. The suc- 
cess of the present analysis in understanding the main 
features of the MJ matrix gives us confidence that the 
model used here may provide a starting point for building 
a true potential suitable for use in a molecular dynamical 
description of early folding of protein in water. 
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TABLE I. Values for q's, Q's (in Debye), W, £Q* , and 
AG (self-solvation corrected hydrophobicities) ; see text. 

Res. q Q CQ* AG 

Cys 
Met 
Phe 
He 
Leu 
Val 
Trp 
Tyr 
Ala 
Gly 
Thr 
Ser 
Asn 
Gin 
Asp 
Glu 
His 
Arg 
Lys 
Pro 



TABLE II. Linear correlations. 



Pair entries 


Correlation 


Correlation w/o Pro 


q vs. W* 


0.997 


0.997 


q vs. Q 


0.753 


0.775 


W* vs. Q 


0.743 


0.767 


q vs. AG 


0.836 


0.880 


W* vs. AG 


0.820 


0.866 


Q vs. AG 


0.843 


0.839 


q vs. £Q* 


0.949 


0.949 


W* vs. £,Q* 


0.932 


0.933 


AG vs. £Q* 


0.890 


0.923 



-0.265 


0.540 


-0.246 


-1.36 


-3.33 


-0.327 


0.218 


-0.707 


-1.54 


-2.78 


-0.438 


0.393 


-1.512 


-1.44 


-5.40 


-0.390 


0.046 


-1.087 


-1.63 


-5.03 


-0.443 


0.006 


-1.502 


-1.66 


-5.03 


-0.315 


0.021 


-0.633 


-1.65 


-3.63 


-0.298 


0.762 


-0.656 


-1.23 


-4.77 


-0.226 


2.40 


-0.355 


-0.315 


1.63 


-0.125 


0.00 


0.531 


-0.403 


-1.12 


-0.048 


0.00 


0.845 


-0.403 


0.00 


-0.058 


2.39 


0.828 


-0.078 


-0.70 


-0.011 


2.40 


1.076 


-0.076 


0.17 


-0.011 


4.03 


1.104 


0.145 


3.78 


-0.023 


3.81 


1.038 


0.116 


3.53 


0.040 


4.29 


1.302 


0.180 


2.62 


0.028 


6.08 


1.334 


0.424 


2.97 


-0.107 


2.85 


0.429 


-0.014 


1.82 


-0.020 


4.90 


1.043 


0.264 


6.48 


0.065 


8.09 


1.648 


0.697 


4.10 


-0.054 


1.40 


0.907 


-0.212 


-2.92 



